Today, we will be going through The Impact of Hurricane Maria on Out-migration from Puerto Rico: Evidence from Facebook Data. The GitHub Repository can be found here.The GitHub Repository for this tutorial can be found here.
In this paper, Alexander et al use Facebook advertising data to estimate migration of Puerto Ricans to the continental United States after Hurricane Maria. Given the large scale use of Facebook in Puerto Rico, they argue that using Facebook data is appropriate in this instance. They collect Facebook demographic data every 2-3 months (they have six waves in total) from January 2017 to March 2018. Hurricane Maria took place between September 2017 and October 2017, enabling the authors to see the effect of the hurricane on the population demographics. They use a difference-in-differences model to compare the size of the population before and after the hurricane, finding that between October 2017 and January 2018, there was a 17% increase in the number of migrants from Puerto Rico the the continental United States–with the most migrating to Florida. They also find that there were more male migrants than females, and there were more in younger working age groups. They also find that after the hurricane, 1.8% of migrants returned to Puerto Rico. The findings aligned with prior research.
A few examples of instances in which the Facebook Ads Manager has previously been used include:
In this R Tutorial, we will be using the data Alexander et al. provide on GitHub for replicability. We will not be extracting new data from the Facebook API. However, this is the general process:
For more information on how to extract data from the Facebook Ads Manager API using R see this guide.
You may find more information on how to extract FB data using the API & Python: here, here, and here.
First, we will download the necessary libraries.
## load libaries
library(kableExtra) # for making pretty tables
library(RColorBrewer) # for color palettes
library(lubridate) # for time data
library(rnaturalearth) # for world shapefile
library(sf) # simple feature mapping
library(mapview) # for simple interactive mapping
library(tmap) # for interactive mapping
library(tidyverse) # for data wrangling
#library(ggplot2) # for pretty graphs
library(viridis) # for plot colors
library(migest) # for migration visualization
library(dplyr) # for magritte
Next, we will load in the FB data from GitHub, wrangle the data into the format we want, and then create the different waves.
## load the raw facebook data
#fb.raw <- read_csv("./data/facebook/master_acs_facebook_data.csv") # if you've downloaded or cloned the github repo, otherwise
fb.raw <- read_csv("https://raw.githubusercontent.com/MJAlexander/fb-migration-hurricane-maria/master/data/facebook/master_acs_facebook_data.csv")
dim(fb.raw) # 25,200 rows, 30 columns
## clean & formatting the data
fb <- fb.raw %>%
# extract age group from character string (ex ages_15_19)
mutate(age_group = as.numeric(unlist(lapply(strsplit(age_group, "_"), '[[', 2)) ) ) %>%
#selecting relevant variables
select(state, origin, age_group, sex, expat_population_wave1:facebook_population_wave7) %>%
# add wave indicator
pivot_longer(5:18, names_to="var", values_to="value") %>%
mutate(wave = as.numeric(substr(var, str_length(var), str_length(var)) ),
var = substr(var, 1, str_length(var)-6)) %>%
pivot_wider(names_from=var, values_from=value) %>%
# "fix" state names (states with a space in the name, e.g. new york, are coded with an underscore in the name, e.g. new_york)
mutate(state = ifelse(grepl("_", state), str_replace(state, "_", " "), state))
## add in corresponding dates for each wave:
## requires lubridate package (part of the tidyverse, for date/time data, but need to separately load library)
#require(lubridate)
fb <- fb %>%
mutate(date = case_when(
wave==1~ymd(20170101),
wave==2~ymd(20170401),
wave==3~ymd(20170601),
wave==4~ymd(20171001),
wave==5~ymd(20180101),
wave==6~ymd(20180301),
wave==7~ymd(20180701)
))
## look at cleaned fb data
#dim(fb) # 176,400 rows, 8 columns: 7 waves x 2 sex x 9 age groups x 50 states x 28 origin countries
head(fb)# note the date format on the date column; not numerical but in data format
## # A tibble: 6 x 8
## state origin age_group sex wave expat_population facebook_population
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama Australia 15 2 1 30 120000
## 2 Alabama Australia 15 2 2 30 99000
## 3 Alabama Australia 15 2 3 30 97000
## 4 Alabama Australia 15 2 4 20 90000
## 5 Alabama Australia 15 2 5 20 81000
## 6 Alabama Australia 15 2 6 1000 86000
## # ... with 1 more variable: date <date>
The FB data is in long format, and has an observation (row) for each state-country of origin-sex-age group combination. There are a total of 176,400 rows: one for each combination of wave (7), sex (2), 5-year age groups (9), states (50; DC is not included), and country of origin (28).
Note: The data from waves 6 and 7 contain a significant number of “1000”s in the expat_population column (estimated number of migrants), which is suggestive of bottom-coding. In earlier waves, it appears the corresponding bottom-code count was “20”. It is possible the large numbers of “20”s in the earlier waves, and “1000”s in later waves, may be hiding other potential issues. In general, the FB expat_population estimates clearly involve some rounding.
Here are some descriptives and visualizations to better understand the clean data. Each new ‘wave’ of data is collected on the first of the month, every couple of months. As is evident from the table below, the number of migrants (called expat_population) is similar from Wave 1 to Wave 5. There is a decrease from Wave 5 to Wave 6. Wave 6 and Wave 7 are more similar to each other.
Because the data contains an observation for each state-country of origin-sex-age group combination, and there is evidence of bottom coding and data suppression, the estimates below are likely overestimates. This is especially obvious from the estimates for waves 6 and 7.
#looking at migrant count by date: helps us see what 'dates' are what 'waves'
mig_count <- fb %>% group_by(wave, date) %>% summarise(n = sum(expat_population))
kable(mig_count,
format.args=list(big.mark=','), col=c("Wave", "Date", "Count")) %>%
kable_paper()
| Wave | Date | Count |
|---|---|---|
| 1 | 2017-01-01 | 16,895,070 |
| 2 | 2017-04-01 | 16,760,020 |
| 3 | 2017-06-01 | 17,107,930 |
| 4 | 2017-10-01 | 15,406,610 |
| 5 | 2018-01-01 | 16,971,970 |
| 6 | 2018-03-01 | 37,461,200 |
| 7 | 2018-07-01 | 37,766,900 |
When we disaggregate the data by different waves, we also find that there are differences in the migration experiences of males and females. In general, there are slightly more male migrants than female migrants. This difference is a little more accentuated in Wave 1, Wave 3, Wave 5 and Wave 6.
To do this, we group by wave and sex, adding the number of migrants (expat_population) in the FB data set. We then use ggplot2 to create a bar graph to visualize these differences.
#looking at how many people who migrated were male vs female according to wave
sex_wave_comparison <- fb %>% group_by(wave, sex) %>% summarise(n = sum(expat_population))
#graphing waves
sex_wave_comparison$wave_factor<- factor(sex_wave_comparison$wave)
sex_wave_comparison$sex_character<- as.character(sex_wave_comparison$sex)
ggplot(sex_wave_comparison, aes(fill=sex_character, y=n, x=wave)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_brewer(palette="Dark2") +
labs( x = "wave", y = "count", fill = "sex") +
coord_flip()+
scale_x_discrete(name = "wave", limits=c(1,2,3,4,5,6,7)) +
scale_fill_discrete(name = "sex", labels = c("male", "female"))
Now, we will examine the age patterns of migrants within the different waves. As per the table and graph below, of all the migrants in the Facebook data, there were more migrants in their mid-20s and mid-30s compared to younger (teenagers) or older (past 50) populations. All of the waves had similar age patterns.
#looking at how many people who migrated were of what age group generally
age_comparison <- fb %>% group_by(age_group, wave, date) %>% summarise(n = sum(expat_population))
kable(age_comparison %>%
mutate(wave = paste("wave", wave, date, sep="_")) %>%
select(-date) %>%
pivot_wider(names_from=wave, values_from=n),
format.args=list(big.mark=',')) %>%
kable_paper()
| age_group | wave_1_2017-01-01 | wave_2_2017-04-01 | wave_3_2017-06-01 | wave_4_2017-10-01 | wave_5_2018-01-01 | wave_6_2018-03-01 | wave_7_2018-07-01 |
|---|---|---|---|---|---|---|---|
| 15 | 945,970 | 884,480 | 901,210 | 660,940 | 770,730 | 3,229,600 | 3,209,900 |
| 20 | 2,308,190 | 2,227,480 | 2,257,480 | 1,876,490 | 2,068,380 | 4,252,300 | 4,247,500 |
| 25 | 2,768,310 | 2,730,640 | 2,780,030 | 2,472,540 | 2,739,000 | 4,874,200 | 4,953,300 |
| 30 | 2,657,730 | 2,639,980 | 2,699,240 | 2,450,700 | 2,653,520 | 4,847,900 | 4,885,000 |
| 35 | 2,475,630 | 2,458,660 | 2,519,720 | 2,351,390 | 2,538,210 | 4,763,900 | 4,796,100 |
| 40 | 2,034,990 | 2,032,440 | 2,080,290 | 1,945,490 | 2,091,520 | 4,374,400 | 4,409,000 |
| 45 | 1,646,890 | 1,674,260 | 1,710,910 | 1,626,080 | 1,796,720 | 4,066,800 | 4,128,900 |
| 50 | 1,177,110 | 1,200,870 | 1,232,050 | 1,155,020 | 1,305,970 | 3,652,400 | 3,701,600 |
| 55 | 880,250 | 911,210 | 927,000 | 867,960 | 1,007,920 | 3,399,700 | 3,435,600 |
ggplot(age_comparison, aes(x=age_group, y=n, color=factor(wave)))+
geom_line() +
#geom_bar(position="dodge", stat= "identity", fill="#69b3a2") +
labs(x = "Age", y = "Count", color="wave")
We will now use chord maps to map out migration patterns. When we compare the the migration patterns of those in the FB data set for Waves 4, 5, and 6. Hurricaine Maria occurs between Waves 4 and 5. Between Waves 4 and 5, we find a slight increase in Puerto Rican migrants to Florida. We see a slight decrease in migrants to California and Texas. Waves 5 and 6 are very similar abet an increased migration flow to Washington (we will look at this in detail later in this tutorial).
We restrict the origin to Puerto Rico, filtering by waves 4 & 5 and the states that are popular relocation sites.
Wave 4
#selecting relevant variables: you need these three for a chord chart
mig_flow4 <- fb %>% filter(wave==4, origin== "Puerto Rico", state=="Texas"|state=="Florida"|state=="New Jersey"|state=="New York"|state=="Pennsylvania"|state=="Connecticut"|state=="Illinois"|state=="California"|state=="Massachusetts"|state=="Washington") %>% select(origin, state, expat_population)
#putting our data into the mig_chord function
mig_flow4 %>% mig_chord()
Wave 5
mig_flow5 <- fb %>% filter(wave==5, origin== "Puerto Rico", state=="Texas"|state=="Florida"|state=="New Jersey"|state=="New York"|state=="Pennsylvania"|state=="Connecticut"|state=="Illinois"|state=="California"|state=="Massachusetts"|state=="Washington") %>% select(origin, state, expat_population)
mig_flow5 %>% mig_chord()
Wave 6
#selecting relevant variables: you need these three for a chord chart
mig_flow6 <- fb %>% filter(wave==6, origin== "Puerto Rico", state=="Texas"|state=="Florida"|state=="New Jersey"|state=="New York"|state=="Pennsylvania"|state=="Connecticut"|state=="Illinois"|state=="California"|state=="Massachusetts"|state=="Washington") %>% select(origin, state, expat_population)
#putting our data into the mig_chord function
mig_flow6 %>% mig_chord()
Alexander et al use FB data to estimate the size of the US population. From their estimations, there seems to be a general decrease in the coverage of the US population from Wave 1 to Wave 7, with an particular decrease in Wave 5. These estimations are clearly an undercount compared to population estimates by the US Census Bureau.
In 2017, the US population was around 325 million according to the US Census Bureau American Community Survey and Population Estimates Program.
In the code, Alexander et al filter by Mexico, but it doesn’t necessarily matter: all countries have the same total population. Here, we filter the clean data by origin (Mexico) and the regroup the data according to wave and date.
## estimate total facebook population by wave
# filtering by Mexico but it doesn't matter - all countries are the same total population
# this is the total (fb) population per state-age-sex-wave combination
total_fb_pop <- fb %>% filter(origin=="Mexico") %>% group_by(wave, date) %>% summarise(fb = sum(facebook_population))
#creating a table of the total fb pop
kable(total_fb_pop,
format.args=list(big.mark=','), col=c("wave", "date", "estimated US pop")) %>%
kable_paper()
| wave | date | estimated US pop |
|---|---|---|
| 1 | 2017-01-01 | 184,755,000 |
| 2 | 2017-04-01 | 173,047,000 |
| 3 | 2017-06-01 | 172,443,900 |
| 4 | 2017-10-01 | 169,208,700 |
| 5 | 2018-01-01 | 164,848,800 |
| 6 | 2018-03-01 | 171,468,000 |
| 7 | 2018-07-01 | 172,371,000 |
#using ggplot to visualize fb pop estimations by wave
ggplot(data=total_fb_pop, aes(x=wave, y=fb/1e6)) +
geom_line()+
geom_point()+
labs(title="Estimating coverage of the US Population Using FB data",x="Waves", y = "Population Estimate in millions") +
scale_x_continuous(name = "wave", breaks=seq(1:7)) +
theme(panel.grid.minor.x = element_blank())
We provide another way to estimate US population count using FB data here (the results are the same as above):
# alternatively
fb %>%
# remove duplicate entries
distinct(state, age_group, sex, wave, facebook_population, date) %>%
# calculate estimated population by wave
group_by(wave) %>%
summarize(facebook_population = sum(facebook_population), .groups='drop')
The FB data contains estimates of expats from 28 countries.
## unique expat groups (country of origin)
(fb.origins <- unique(fb$origin)) # there are only 28, as noted in the paper
## [1] "Australia" "Austria" "Canada" "China" "France"
## [6] "Germany" "Greece" "Hungary" "India" "Indonesia"
## [11] "Ireland" "Israel" "Italy" "Japan" "Malaysia"
## [16] "Mexico" "Nepal" "Philippines" "Poland" "Portugal"
## [21] "Puerto Rico" "Romania" "Russia" "Singapore" "South Korea"
## [26] "Spain" "UAE" "Vietnam"
In the FB data, we find the largest number of migrants in the US are of Mexican origins. This is true for all 7 waves (from January 2017 to July 2018). Notably, the number of Mexican migrants is significantly larger than the estimated migrant population from the other 27 origin countries.
There is a marked a marked difference in the number of migrants from Puerto Rico between Waves 3 and 4, 4 and 5, and beyond Wave 5.
Note: There is a difference in counts between Waves 5 and 6 (and a extreme similarity between Waves 6 and 7). This illustrates that FB data isn’t always stable: it can suddenly change. Alexander et al note that a limitation is the opaqueness of the Facebook algorithm or shift in privacy policies. For example, it is interesting to note that the Cambridge Analytica data scandal (where the personal data belonging to millions of Facebook users was collected without their consent) occurred in March 2018 (during Wave 6). Perhaps the increase in media attention regarding social media privacy policies spurred Facebook to change their algorithms, shift their estimates, or change how data is shared in the Ads manager (i.e. bottom-coding, rounding).
## size of expat groups in the us, according to the fb data
fb.expat <- fb %>% #filter(wave <= 5) %>% # the counts for waves 6 and 7 look very different when compared to wave 5
group_by(origin, wave) %>%
summarise(expat_pop = sum(expat_population), .groups='drop') %>%
arrange(wave, -expat_pop) %>%
mutate(wave = paste0("pop_wave", wave)) %>%
pivot_wider(names_from=wave, values_from=expat_pop)
fb.expat %>%
kable(format.args=list(big.mark=',')) %>%
kable_paper("hover")
| origin | pop_wave1 | pop_wave2 | pop_wave3 | pop_wave4 | pop_wave5 | pop_wave6 | pop_wave7 |
|---|---|---|---|---|---|---|---|
| Mexico | 8,499,220 | 8,382,780 | 8,556,120 | 7,941,350 | 8,488,080 | 8,992,300 | 9,167,700 |
| India | 1,672,300 | 1,734,900 | 1,762,700 | 1,603,500 | 1,718,990 | 2,186,400 | 2,225,300 |
| Philippines | 1,276,050 | 1,345,540 | 1,373,840 | 1,240,100 | 1,346,940 | 1,817,800 | 1,858,400 |
| Puerto Rico | 1,029,590 | 1,058,380 | 1,162,570 | 964,530 | 1,235,470 | 1,765,600 | 1,816,200 |
| China | 651,730 | 594,080 | 591,150 | 456,350 | 442,500 | 1,122,800 | 1,111,300 |
| Vietnam | 612,160 | 607,080 | 622,820 | 570,230 | 641,280 | 1,214,800 | 1,239,400 |
| Canada | 444,500 | 437,630 | 430,930 | 364,070 | 457,240 | 1,052,500 | 1,019,600 |
| South Korea | 354,600 | 330,420 | 331,800 | 282,540 | 328,890 | 1,014,300 | 1,010,800 |
| Germany | 284,280 | 249,940 | 250,470 | 213,850 | 225,320 | 926,300 | 926,000 |
| Japan | 209,980 | 199,340 | 198,580 | 171,820 | 212,840 | 944,400 | 942,200 |
| Russia | 197,470 | 187,830 | 189,040 | 171,750 | 203,440 | 939,900 | 944,800 |
| France | 193,460 | 173,720 | 171,420 | 140,670 | 151,120 | 929,500 | 929,300 |
| Italy | 182,790 | 165,330 | 167,910 | 141,410 | 167,720 | 919,100 | 925,800 |
| Poland | 174,920 | 169,380 | 169,030 | 159,560 | 184,070 | 958,300 | 958,900 |
| Indonesia | 156,280 | 123,860 | 126,190 | 92,340 | 111,970 | 914,300 | 919,300 |
| Nepal | 151,170 | 152,070 | 152,730 | 141,940 | 175,770 | 924,500 | 926,800 |
| Spain | 123,120 | 168,330 | 167,850 | 146,870 | 126,120 | 907,600 | 906,500 |
| Australia | 111,870 | 123,270 | 125,220 | 105,300 | 115,120 | 912,800 | 914,500 |
| Israel | 110,500 | 104,720 | 105,530 | 93,550 | 104,180 | 916,500 | 921,400 |
| Romania | 101,390 | 97,880 | 97,680 | 85,780 | 110,360 | 900,800 | 900,700 |
| Malaysia | 68,970 | 62,760 | 62,900 | 52,250 | 77,070 | 900,000 | 900,000 |
| Ireland | 67,540 | 68,100 | 68,090 | 62,510 | 63,220 | 900,300 | 900,800 |
| Portugal | 67,140 | 69,720 | 70,540 | 65,320 | 84,670 | 900,400 | 901,200 |
| Hungary | 38,030 | 37,900 | 38,010 | 34,650 | 36,490 | 900,000 | 900,000 |
| Singapore | 35,950 | 35,490 | 35,580 | 31,240 | 51,800 | 900,000 | 900,000 |
| UAE | 35,630 | 35,210 | 35,040 | 31,060 | 68,540 | 900,000 | 900,000 |
| Austria | 26,430 | 26,360 | 26,190 | 24,070 | 24,760 | 900,000 | 900,000 |
| Greece | 18,000 | 18,000 | 18,000 | 18,000 | 18,000 | 900,000 | 900,000 |
We now create a chord map to visualize the migration patterns of those in the FB data set. As is clearly evident from the chart, many migrants are coming from Mexico and India. California and New Jersey are popular relocation sites.
#selecting relevant variables: you need these three for a chord chart
# this is wave 4 only
mig_flow <- fb %>% filter(wave == 4 & origin %in% c("Mexico", "India", "Puerto Rico")) %>% group_by(state, origin) %>% summarize(expat_population = sum(expat_population), .groups='drop')
#putting our data into the mig_chord function
mig_flow %>% mig_chord()
Here, we use the FB data to see what it says about the migrant population in Washington. First, we estimate total migrant population. Then, we take a closer look at the migrant population in Washington by country of origin.
When estimating migrant population in Washington, we find a steady increase from Wave 1 to Wave 5. Then, there is a sharp increase between Wave 5 and Wave 6. This could suggest that the data may not be the most reliable for smaller groups, illustrating a potentially significant barrier in using FB data for research. As a result, we leave out waves 6 and 7 when estimating the migration population in Washington according to country of origin. While there are slight fluctuations over time, a majority of the migrants are coming from Mexico, India, Vietnam, China, and Canada. The number of migrants coming from Mexico to Washington are significantly larger than those from other countries.
## estimated wa population by wave, according to fb
wa.fb <- fb %>%
filter(state == "Washington") %>%
# remove duplicate entries
distinct(state, age_group, sex, wave, facebook_population, date) %>%
group_by(wave, date) %>%
summarize(facebook_population = sum(facebook_population), .groups='drop')
## estimated expat population in wa, according to fb
# notice there's a big jump in numbers between waves 5 and 6
fb %>%
filter(state == "Washington") %>%
group_by(wave) %>%
summarize(expat_pop = sum(expat_population)) %>%
# merge in fb population for wa
left_join(wa.fb, by="wave") %>%
mutate(expat.prop = round(expat_pop/facebook_population, 3)) %>%
relocate(date, .after=wave) %>%
kable(format.args=list(big.mark=','), caption="estimated population sizes in WA, based on FB advertising data",
col.names=c("wave", "date", "estimated WA expat pop", "estimated WA total pop", "proportion WA expat")) %>%
kable_paper("hover")
| wave | date | estimated WA expat pop | estimated WA total pop | proportion WA expat |
|---|---|---|---|---|
| 1 | 2017-01-01 | 459,180 | 4,250,000 | 0.108 |
| 2 | 2017-04-01 | 465,730 | 4,050,000 | 0.115 |
| 3 | 2017-06-01 | 465,890 | 4,040,000 | 0.115 |
| 4 | 2017-10-01 | 431,550 | 3,930,000 | 0.110 |
| 5 | 2018-01-01 | 445,950 | 3,790,000 | 0.118 |
| 6 | 2018-03-01 | 793,700 | 3,960,000 | 0.200 |
| 7 | 2018-07-01 | 811,500 | 3,890,000 | 0.209 |
## estimated expat population in wa by country of origin (waves 1-5 only)
fb %>%
filter(state == "Washington" & wave <= 5) %>% # looks ok(?) for can, cn, in, mx, ph, vn in waves 6, 7. there's a sizable increase in numbers between waves 5 and 6
# calculate number of expats by country of origin, by wave
group_by(wave, date, origin) %>%
summarize(expat_pop = sum(expat_population), .groups='drop') %>%
mutate(wave = paste0("wave", wave)) %>%
pivot_wider(names_from=c(wave, date), values_from=expat_pop) %>%
kable(format.args=list(big.mark=','),
caption = "estimated number of migrants in WA by country of origin, based on FB advertising data") %>%
kable_paper("hover")
| origin | wave1_2017-01-01 | wave2_2017-04-01 | wave3_2017-06-01 | wave4_2017-10-01 | wave5_2018-01-01 |
|---|---|---|---|---|---|
| Australia | 4,020 | 4,710 | 4,710 | 3,980 | 4,230 |
| Austria | 460 | 450 | 490 | 420 | 450 |
| Canada | 23,810 | 23,540 | 23,510 | 21,610 | 23,090 |
| China | 24,820 | 23,940 | 23,700 | 19,260 | 18,380 |
| France | 3,590 | 3,540 | 3,470 | 3,020 | 3,210 |
| Germany | 8,660 | 8,410 | 8,340 | 7,390 | 7,510 |
| Greece | 360 | 360 | 360 | 360 | 360 |
| Hungary | 520 | 600 | 610 | 510 | 550 |
| India | 56,710 | 60,390 | 60,730 | 57,440 | 60,200 |
| Indonesia | 5,730 | 4,980 | 4,920 | 3,880 | 4,530 |
| Ireland | 1,500 | 1,440 | 1,500 | 1,320 | 1,340 |
| Israel | 1,830 | 1,790 | 1,820 | 1,660 | 1,870 |
| Italy | 3,080 | 2,870 | 2,960 | 2,420 | 2,480 |
| Japan | 10,220 | 9,710 | 9,690 | 8,470 | 9,010 |
| Malaysia | 1,860 | 1,880 | 1,870 | 1,510 | 1,670 |
| Mexico | 190,800 | 194,100 | 193,600 | 186,100 | 190,600 |
| Nepal | 3,040 | 2,880 | 2,900 | 2,720 | 2,850 |
| Philippines | 46,920 | 51,890 | 51,680 | 46,920 | 49,640 |
| Poland | 1,870 | 1,830 | 1,790 | 1,600 | 1,730 |
| Portugal | 390 | 440 | 420 | 380 | 400 |
| Puerto Rico | 6,650 | 4,660 | 4,730 | 4,410 | 4,880 |
| Romania | 3,980 | 4,060 | 3,990 | 3,620 | 3,760 |
| Russia | 9,760 | 9,750 | 9,700 | 9,190 | 9,570 |
| Singapore | 1,180 | 1,200 | 1,210 | 1,050 | 1,150 |
| South Korea | 14,470 | 13,910 | 13,870 | 12,280 | 13,560 |
| Spain | 2,170 | 2,170 | 2,110 | 1,790 | 1,870 |
| UAE | 710 | 690 | 710 | 610 | 650 |
| Vietnam | 30,070 | 29,540 | 30,500 | 27,630 | 26,410 |
Maps are a great way to visualize migration patterns. Static maps are the most straight-forward way to showcase a map.
Here, we look at the migration experiences during Wave 4 (this is when Hurricane Maria took place). The maps visually show what we found earlier: that the estimated population size for migrants from Mexico is much larger than those from other countries. To attenuate the difference in the map, the population size is logged. It’s also important to remember that the FB data only covers 28 countries: it is not the case that there are no migrants from the entire African continent or Latin America. The visualization make this discrepancy more clear.
## plot the data on expat groups
# this is from ggplot2
world_map <- map_data("world")
#head(world_map) # this is lat/long point data
## map of fb data for october 2017
fb.pop.counts.w4 <- fb %>%
filter(wave == 4) %>%
group_by(origin, wave) %>%
summarize(pop = sum(expat_population), .groups='drop') %>%
arrange(pop)
fb.pop.counts.w4 %>%
# need to keep all country outline data
right_join(world_map,
by=c("origin"="region") ) %>%
# note that mexico has a large number while other countries are much smaller in comparison
# might need to log scale, or set manual breaks for comparison
ggplot() +
geom_polygon(aes(x=long, y=lat, group=group, fill=log(pop)), color="black") +
coord_fixed(1.3) +
scale_fill_distiller(na.value="white", direction=1, palette="Reds", name="estimated expats \n in log(population)") +
labs(subtitle="Country of origin of US migrants, estimated from FB data (Oct 2017)") +
theme_void() +
theme(legend.position="bottom")
Unlike static maps, interactive maps allow the user to zoom in and out, identify specific features, or query underlying data or indicator (i.e. socioeconomic status). Here, we created a quick and simple interactive map for exploratory data analysis on the spatial context, first using the mapview library, then the tmap library.
mapview
You can find more detailed instructions on how to create an interactive map with this mapview tutorial and the CRAN documentation.
## download world outline data (sf)
#require(rnaturalearth)
world <- ne_countries(scale="medium", returnclass="sf")
## create interactive map displaying fb estimates by country of origin
world %>%
sf::st_transform(crs=4326) %>%
# merge in fb population estimates to shapefile
left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>%
select(admin, name, pop_est, continent, region_un, subregion, geometry, wave, pop) %>%
mapview(zcol="pop", # which column to map
map.types=c("OpenStreetMap", "CartoDB.Positron", "Esri.WorldImagery"), # define which base map(s) to display
at=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop) ), # manually define categorical breaks
col.regions=brewer.pal(7, "YlGn") # specify color palette
)
Hovering over a country will give the estimated migrant population size from that country.
tmap
The tmap library can be used to create both static and interactive maps. Static maps are the default.
#require(tmap)
## create static tmap
tmap_options(check.and.fix = TRUE)
world %>%
sf::st_transform(crs=4326) %>%
left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>%
tm_shape() +
tm_fill(col="pop", # which column to plot
id="name", # what to display
breaks=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop)), # manually define breaks
palette="YlGn") + # specify color palette
tm_borders() # add in borders
To create an interactive map, specify the mode to view using tmap_mode("view"). (To toggle it back to static maps, use tmap_mode("plot").) The same code is used to generate the static and interactive maps.
## create interactive map displaying fb estimates by country of origin
tmap_mode("view") # interactive mode
#tmap_mode("plot") # static map (this is the default)
world %>%
sf::st_transform(crs=4326) %>%
left_join(fb.pop.counts.w4, by=c("admin"="origin")) %>%
tm_shape() +
tm_fill(col="pop", # which column to plot
id="name", # what to display
breaks=c(0, 5e4, 1e5, 2e5, 5e5, 1e6, 2e6, max(fb.pop.counts.w4$pop)), # manually define breaks
palette="YlGn") + # specify color palette
tm_borders() # add in borders
Clicking on the country will generate a popup that shows the estimated size of the migrant population from that country.
The code provided by Alexander et al on the GitHub repo indicates they downloaded the 2017 1-year ACS data with sex, age, birthplace (bpl), and state (statefip) variables from IPUMS. They then perform Some data cleaning (the code needed for cleaning the data is available on their Github repo).
We read in the pre-saved ACS 2017 1 year data here:
# read in pre-saved ACS 2017 1-year data
#acs_prop_age <- read_csv("./data/acs_prop_age.csv") # if cloned/downloaded the github repo, otherwise
acs_prop_age <- read_csv("https://raw.githubusercontent.com/MJAlexander/fb-migration-hurricane-maria/master/data/acs_prop_age.csv")
Alexander et al note that FB data is not representative of the broader population and it is difficult for the findings to be generalizable. However, the ACS is designed to be nationally representative. Therefore, the comparison of the FB estimates with the ACS estimates enables us to assess this potential bias.
It is important to note that the FB data encapsulates only 28 different countries. When looking at this data, we see that there are differences in the number of estimated migrants when compared to that of the ACS. There are 10 countries where the FB estimate is greater than the ACS estimate. For the remaining 18 countries, the ACS estimate is larger than the FB estimate. This indicates a misalignment between the data sources.
## national level estimates of number of "expats" in the acs, compared to fb estimates
acs.fb.comp <- acs_prop_age %>%
# calculate number of estimated migrants by country of birth
group_by(origin, bpl) %>%
summarize(acs.est = sum(no_mig), .groups='drop') %>%
select(-bpl) %>%
# keep only the countries available in the fb data
filter(origin %in% fb.origins) %>%
# merge in fb expat data estimates
right_join(fb.expat %>% select(1:5) %>% # waves 1-4 "technically" more comparable since covers 2017
rowwise() %>% mutate(fb.avg = mean(c_across(cols=2:5))) %>% ungroup() , # compute mean estimate over the 4 waves
by="origin") %>%
# compute difference between acs and fb estimates
mutate(diff = acs.est - fb.avg) %>%
arrange(diff) %>%
relocate(fb.avg, .after=acs.est) %>%
relocate(diff, .after=fb.avg)
kable(acs.fb.comp %>% select(1:4), format.args=list(big.mark=',')) %>%
kable_paper("hover")
| origin | acs.est | fb.avg | diff |
|---|---|---|---|
| Indonesia | 70,611 | 124,667.5 | -54,056.5 |
| Nepal | 97,594 | 149,477.5 | -51,883.5 |
| Spain | 102,006 | 151,542.5 | -49,536.5 |
| UAE | 16,201 | 34,235.0 | -18,034.0 |
| Australia | 101,345 | 116,415.0 | -15,070.0 |
| France | 157,959 | 169,817.5 | -11,858.5 |
| Hungary | 29,668 | 37,147.5 | -7,479.5 |
| Malaysia | 55,461 | 61,720.0 | -6,259.0 |
| Austria | 19,800 | 25,762.5 | -5,962.5 |
| Singapore | 30,028 | 34,565.0 | -4,537.0 |
| Italy | 168,926 | 164,360.0 | 4,566.0 |
| Ireland | 73,093 | 66,560.0 | 6,533.0 |
| Israel | 111,919 | 103,575.0 | 8,344.0 |
| Romania | 118,938 | 95,682.5 | 23,255.5 |
| Puerto Rico | 1,087,557 | 1,053,767.5 | 33,789.5 |
| Portugal | 106,827 | 68,180.0 | 38,647.0 |
| Greece | 82,858 | 18,000.0 | 64,858.0 |
| Philippines | 1,416,830 | 1,308,882.5 | 107,947.5 |
| Poland | 289,660 | 168,222.5 | 121,437.5 |
| Canada | 575,457 | 419,282.5 | 156,174.5 |
| Japan | 363,601 | 194,930.0 | 168,671.0 |
| Vietnam | 1,001,298 | 603,072.5 | 398,225.5 |
| South Korea | 817,988 | 324,840.0 | 493,148.0 |
| Germany | 757,517 | 249,635.0 | 507,882.0 |
| Russia | 803,607 | 186,522.5 | 617,084.5 |
| India | 2,541,235 | 1,693,350.0 | 847,885.0 |
| China | 1,917,860 | 573,327.5 | 1,344,532.5 |
| Mexico | 9,846,770 | 8,344,867.5 | 1,501,902.5 |
Note: in the table above, positive values indicate a larger ACS estimate and negative values suggest larger FB values.
## map the results, to see if potential regional differences
world %>% left_join(acs.fb.comp, by=c("admin"="origin")) %>%
select(admin, name, pop_est, continent, region_un, subregion, geometry, acs.est, fb.avg, diff) %>%
mapview::mapview(zcol="diff", # which column to map
map.types=c("OpenStreetMap", "CartoDB.Positron", "Esri.WorldImagery"), # define which base map(s) to display
at=c(min(acs.fb.comp$diff), -1e4, 0, 1e5, 5e5, 1e6, max(acs.fb.comp$diff)), # manually define categorical breaks
col.regions=brewer.pal(6, "RdBu")
)
When calculating the state-level differences, the authors exclude North Carolina from the analysis. They explain that it is in an anomaly. They also examined only states with 18,000+ PRs to avoid rounding issues. While the ACS and FB data have different migrant population estimates at the state level, most of them are comparable. However, the disparity between the two data sets are a slightly greater (indicating a greater misalignment) for Connecticut and New Jersey. Florida, New York, and Pennsylvania have the largest number of migrants coming in to the state.
# 4. Top states based on PR population --------------------------------
# table of expat populations and proportions in facebook and acs
# remove North Carolina because there seems to be an anomaly in the data
# filter to include only states above 18000 to avoid rounding issues.
pr.state <- fb %>%
# authors do not include north carolina in their analysis for some reason (??)
filter(!state %in% c("North Carolina")) %>%
# keep fb wave 4 data only, and origin country = pr
filter(wave==4, origin=="Puerto Rico") %>%
# merge in acs data on pr population
left_join(acs_prop_age %>%
filter(origin == "Puerto Rico"),
by=c("state", "origin", "age_group", "sex")) %>%
# calculate for each state:
group_by(state) %>%
summarise(expat_population = sum(expat_population), # total fb pr expat population
total_pop = sum(facebook_population), # total fb population
expat_population_acs = sum(no_mig), # total acs pr migrant pop
expat_proportion_acs = round(sum(prop_pop), 3) # proportion of migrants in state
) %>%
# calculate fb expat proportion (out of all fb users)
mutate(expat_proportion = round(expat_population/total_pop, 3)) %>%
select(state, expat_population, expat_population_acs, expat_proportion, expat_proportion_acs) %>%
# arrange in decreasing numbers of fb pr migrants by state
arrange(-expat_population)
# states where pr expat pop > 18000 (12, if north carolina not dropped)
pr.state %>%
filter(expat_population>18000) %>%
kable(format.args=list(big.mark=',')) %>%
kable_paper("hover")
| state | expat_population | expat_population_acs | expat_proportion | expat_proportion_acs |
|---|---|---|---|---|
| Florida | 304,100 | 302,931 | 0.028 | 0.052 |
| New York | 107,000 | 131,612 | 0.009 | 0.022 |
| Pennsylvania | 91,800 | 100,303 | 0.014 | 0.027 |
| Massachusetts | 71,800 | 88,683 | 0.020 | 0.042 |
| Texas | 57,160 | 52,574 | 0.004 | 0.006 |
| Connecticut | 47,220 | 63,956 | 0.028 | 0.059 |
| New Jersey | 42,640 | 78,691 | 0.012 | 0.029 |
| Illinois | 23,590 | 26,677 | 0.003 | 0.007 |
| Ohio | 22,820 | 25,581 | 0.004 | 0.007 |
| California | 19,500 | 23,873 | 0.001 | 0.002 |
| Georgia | 19,110 | 19,892 | 0.003 | 0.006 |
(Figure 1 from the paper)
When examining the age distribution of the migrants, it is evident that in all the states, the ACS has an older age distribution while the Facebook distributions are greater for the under-30 age groups. The ACS trends differ in all age groups.
# 7. age change -----------------------------------------------------------
## compare age distributions (figure 1 in the paper)
acs_prop_age %>%
# pr individuals in the acs only
filter(origin=="Puerto Rico") %>%
# merge in fb data on pr expats from wave 1 (jan 2017)
left_join(fb %>% filter(wave==1, origin=="Puerto Rico")) %>%
# calculate proportion of expats in each age group out of total fb pop per sex-state combination
group_by(state, sex) %>%
mutate(expat_prop = expat_population/sum(expat_population, na.rm = T)) %>%
ungroup() %>%
select(state, sex, age_group, prop_mig, expat_prop) %>%
# look at 9 selected states, males
filter(sex==1, state %in% c("Florida", "New York", "New Jersey",
"Pennsylvania", "Connecticut", "Illinois",
"California", "Texas", "Massachusetts")) %>%
ggplot(aes(age_group, prop_mig)) +
geom_line(lwd = 0.8) +
geom_line(aes(age_group, expat_prop), color = "red", lty = 2, lwd = 0.8) +
facet_wrap(~state)+
theme_bw(base_size = 14) +
ylab("proportion") + xlab("age group")
As is evident in the table below, when we use the Facebook Data, the states with larger Puerto Rican populations showcase greater male to female sex ratios. However, it is important to note that this does not hold true for Texas and California. The Facebook sex ratios are generally greater than the ACS sex ratios, indicating higher proportions of men than the ACS. The following Table is Table 1 in the paper. The visualization showcases the difference in sex ratios by data source for selected states.
# 8. sex ratios -----------------------------------------------------------
#https://stats.stackexchange.com/questions/21167/standard-error-of-a-ratio
## sex ratios in ACS vs fb, by state
sr.nat <- acs_prop_age %>%
mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>%
group_by(state, pr, sex) %>%
summarise( no_mig = sum( no_mig, na.rm = T)) %>%
group_by(state, pr) %>%
# acs sex ratio
summarise(sr_acs = no_mig[sex==1]/no_mig[sex==2]) %>%
group_by(state) %>%
filter(state %in% c("Florida", "New York", "New Jersey",
"Pennsylvania", "Illinois", "California", "Connecticut",
"Texas", "Massachusetts")) %>%
left_join(fb %>%
mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>%
mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>%
group_by(wave, state, pr, sex) %>%
summarise(expat_population = sum(expat_population, na.rm = T)) %>%
group_by(wave, state, pr) %>%
summarise(sr_fb = expat_population[sex==1]/expat_population[sex==2]) %>% # fb sex ratio (m:f)
filter(state %in% c("Florida", "New York", "New Jersey",
"Pennsylvania", "Illinois", "California", "Connecticut",
"Texas", "Massachusetts"), wave<6)) %>% ungroup() %>%
filter(pr==1, wave==1) %>% select(-pr, -wave)
## table comparing sex ratio in acs data vs fb data (table 1 in the paper)
sr.nat %>%
kable(digits=3, col.names=c("state", "ACS sex ratio", "Facebook sex ratio")) %>%
kable_paper()
| state | ACS sex ratio | Facebook sex ratio |
|---|---|---|
| California | 1.298 | 0.937 |
| Connecticut | 0.854 | 1.352 |
| Florida | 1.015 | 1.187 |
| Illinois | 1.094 | 1.143 |
| Massachusetts | 0.883 | 1.251 |
| New Jersey | 1.020 | 1.202 |
| New York | 0.905 | 1.194 |
| Pennsylvania | 0.979 | 1.175 |
| Texas | 1.067 | 1.013 |
## plot difference in sex ratio by data source, for select states
sr.nat %>%
pivot_longer(2:3, names_to="source", values_to="sr") %>%
separate(source, into=c(NA, "source")) %>%
arrange(source, sr) %>%
ggplot(aes(y=fct_reorder2(state, source, sr), # sort values
x=sr, color=source)) +
geom_point() +
geom_vline(xintercept=1, linetype=2) +
labs(x="sex ratio", y="state") +
scale_y_discrete(limits=rev) + # for reversing the order of the states in the plot
theme_bw() +
theme(legend.position="bottom")
Difference-in-differences models are panel data methods applied when we are attempting to understand the effect of a particular treatment. Difference-in-differences models are used when some samples are exposed to a treatment and others are not. In this study, Alexander et al used difference-in-differences to estimate the percent change in Puerto Rican migrants in the continental US after Hurricane Maria. They choose this method to overcome issues of non-representativness and fluctuations in the data over time (indpendent of migration events).
Difference-in-difference models haven been be used to study the effects of a particular policy, in papers focused on the economics of education, and labor economics.
In this paper, the diff-in-diff estimator is \[ \pi_{g} = \frac{P_{g}^{PR}(5) - P_{g}^{PR}(4)}{P_{g}^{PR}(4)} - \frac{P_{g}^{c}(5) - P_{g}^{c}(4)}{P_{g}^{c}(4)} \] where \(g\) is the group of interest (age, sex, country of origin, state of residence). The authors are interested in assessing the difference in the size of the migrant population from Puerto Rico between waves 4 and 5, relative to the change in the size of the migrant population from all other countries, sans Mexico.
The control group includes migrants from all other countries aside from Mexico, which is assumed to follow different migration motivations. Effectively, in using this estimator, we are comparing the change in PR migrant population in the US between waves with the change in population for all other migrants (but not Mexican immigrants).
Using the difference-in-difference model, Alexander et al find a 17% increase in the number of Puerto Rican migrants present in the continental US between October 2017 and January 2018.
exp_prop_usa <- fb %>%
# exclude migrants from puerto rico or mexico
filter(origin != "Puerto Rico", origin != "Mexico") %>%
# calculate number of expats in each wave
group_by(wave) %>%
summarise(expat = sum(expat_population)) %>%
# add in total number of fb users per wave
left_join(total_fb_pop, by="wave") %>%
# calculate proportion of expats out of total fb population
mutate(prop = expat/fb) %>%
# merge this information back into fb data
left_join(fb %>%
# calculate number of pr expats in the us, per wave
filter(origin=="Puerto Rico") %>%
group_by(wave) %>%
summarise(expat_pr = sum(expat_population)),
by="wave") %>%
# calculate proportions
mutate(prop_pr = expat_pr/fb, # proportion of pr expat users among total fb users
var_prop = prop*(1-prop)/(fb/1000), # binomial variance for all other migrants (excludes mx)
var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>% # binomial variance for pr migrants
# calculate diff in diff
mutate(prop_diff = (prop - lag(prop))/lag(prop), # calculate difference in proportion of all other migrants between waves
se_diff = sqrt(var_prop + lag(var_prop)),
prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr), # calculate difference in pr migrants
se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),
diff_in_diff = (prop_pr_diff - prop_diff)*100, # calculate diff-in-diff estimator, x100 to get pct
se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) # calculate se for diff-in-diff estimator
# pull out diff-in-diff estimator (+se, 95% ci) between waves 4 and 5
res <- exp_prop_usa %>% filter(wave==5) %>%
select(percent_increase = diff_in_diff, se = se_diff_diff) %>%
# calculate ~95% ci
mutate(ci_lower = percent_increase - (2*se),
ci_upper = percent_increase + (2*se))
res %>%
kable(digits=2) %>%
kable_paper()
| percent_increase | se | ci_lower | ci_upper |
|---|---|---|---|
| 17.03 | 0.07 | 16.88 | 17.18 |
Here, we take the estimate percent changes obtained from the difference-in-difference models and multiply them by the relevant populations reported in the ACS. This enables us to estimate the change in the number of migrants.
\[\hat{M}_{g} = \hat{\pi}_{g} \times P_{g}^{ACS}\] where \(\hat{M_{g}}\) is the estimated number of migrants in group \(g\), \(\hat{\pi}_{g}\) is the proportional change in PR migrants in group \(g\), and \(P_{g}^{ACS}\) is the estimated number of PR migrants in group \(g\) in the ACS.
acs_prop_age %>%
filter(origin=="Puerto Rico") %>%
# calculate mean number of migrants, using previously calculated percentage of movers and 95% bounds
summarise(mean = sum(no_mig)*res$percent_increase/100,
lower = sum(no_mig)*res$ci_lower/100,
upper = sum(no_mig)*res$ci_upper/100) %>%
kable(format.args=list(big.mark=",")) %>%
kable_paper()
| mean | lower | upper |
|---|---|---|
| 185,183.4 | 183,567.5 | 186,799.4 |
As is evident form the difference in difference estimates of in-migrants from Puerto Rico: a majority migrate to the states that have an existing population of Puerto Rican migrants post Hurricane Maria. This can indicate community based migration/chain migration. In the code, it is important to note that in expat_pop = 20 seems to be some form of filler data - it gets excluded from the data, but the reasoning behind why is not explained. (This may be due to limited documentation with the Facebook data.) Parallel to above, North Carolina is also excluded (there isn’t necessarily an explanation as to why).
# 5a. State by state analysis (table) ----------------------------------------------
## calculate approximate population per state, by state and wave, using fb data
total_fb_pop_state <- fb %>% filter(origin=="Mexico") %>%
group_by(wave, state) %>%
summarise(fb = sum(facebook_population), .groups='drop')
## calculate expat population proportions by waves
ex_props <- fb %>%
filter(!state %in% c("North Carolina")) %>% # exclude north carolina (?)
# exclude all expats from puerto rico, mexico, where expat_pop = 20
filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>20) %>%
# calculate for each state and fb wave, total number of expats from all origins besides mx, pr
group_by(state, wave) %>%
summarise(expat = sum(expat_population), .groups='drop') %>%
# merge back fb population by state
left_join(total_fb_pop_state,
by=c("state", "wave")) %>%
# calculate proportion of fb expats in each state
mutate(prop = expat/fb) %>%
group_by(state) %>%
# merge in data on pr migrants from the fb data
left_join(fb %>%
filter(origin=="Puerto Rico") %>%
# calculate number of pr expats in each state, by wave
group_by(state, wave) %>%
summarise(expat_pr = sum(expat_population), .groups='drop'),
by=c("state", "wave") ) %>%
# calculate for each state
mutate(prop_pr = expat_pr / fb, # proportion of pr expats out of total fb users
var_prop = prop*(1-prop)/(fb/1000), # variance for proportion of expats out of total fb users
var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>% # variance for proportion of pr users out of total fb users
# calculate diff in diff estimator for each state
# this will throw some warning messages for several states (ak, nd, sd, vt, wy)
# (from the huge difference in expat_pop between waves 5 and 6, 6 and 7)
group_by(state) %>%
mutate(prop_diff = (prop - lag(prop))/lag(prop), # difference in proportion of expats between waves
se_diff = sqrt(var_prop + lag(var_prop)),
prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr), # difference in pr expats between waves
se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)),
diff_in_diff = (prop_pr_diff - prop_diff)*100, # diff-in-diff estimator * 100 (to get percent)
se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) %>%
ungroup()
# table with percent increases and numbers
ex_props %>%
filter(expat_pr>18000, !is.na(diff_in_diff), wave==5) %>%
select(state, prop_pr, diff_in_diff, se_diff_diff) %>%
mutate(percent_increase = round(diff_in_diff, 1),
percent_lower = percent_increase - (2*se_diff_diff),
percent_upper = percent_increase + (2*se_diff_diff)) %>%
select(state, percent_increase, percent_lower, percent_upper) %>%
# arrange states by large -> small pct increase in pr expats
arrange(-percent_increase) %>%
# add in acs data
left_join(acs_prop_age %>%
filter(origin=="Puerto Rico") %>%
# calculate nubmer of pr migrants by state
group_by(state) %>%
summarise(mig = sum(no_mig))) %>%
# calculated estimated number of pr migrants in the us using fb pct estimate and acs pop est
mutate(estimated_number = round(mig*percent_increase/100),
number_lower = round(mig*percent_lower/100),
number_upper = round(mig*percent_upper/100)) %>%
select(-mig) %>%
arrange(-estimated_number) %>%
kable(format.args=list(big.mark=','), digits=2) %>%
kable_paper("hover")
| state | percent_increase | percent_lower | percent_upper | estimated_number | number_lower | number_upper |
|---|---|---|---|---|---|---|
| Florida | 21.6 | 20.91 | 22.29 | 65,433 | 63,342 | 67,525 |
| New York | 11.0 | 10.32 | 11.68 | 14,477 | 13,584 | 15,371 |
| Pennsylvania | 13.4 | 12.66 | 14.14 | 13,441 | 12,700 | 14,181 |
| Connecticut | 14.7 | 12.89 | 16.51 | 9,402 | 8,244 | 10,560 |
| Massachusetts | 10.1 | 8.82 | 11.38 | 8,957 | 7,824 | 10,090 |
| Texas | 10.8 | 10.37 | 11.23 | 5,678 | 5,452 | 5,904 |
| Ohio | 12.8 | 12.22 | 13.38 | 3,274 | 3,125 | 3,424 |
| Illinois | 9.9 | 9.15 | 10.65 | 2,641 | 2,441 | 2,841 |
| Georgia | 13.1 | 12.41 | 13.79 | 2,606 | 2,470 | 2,742 |
| New Jersey | 2.9 | 1.56 | 4.24 | 2,282 | 1,228 | 3,336 |
| California | 2.4 | 1.86 | 2.94 | 573 | 444 | 702 |
The largest increase of Puerto Rican migrants were in Florida.
# 5b. State by state analysis (map) ---------------------------------------
to_map <- ex_props %>%
# keep states/waves with over 18k pr fb expats
filter(expat_pr>18000) %>%
# keep results from wave 5 (estimates difference between waves 4, 5)
filter(!is.na(diff_in_diff), wave==5) %>%
select(state, prop_pr, diff_in_diff) %>%
mutate(percent_increase = round(diff_in_diff, 1)) %>%
select(state, percent_increase) %>%
arrange(-percent_increase)
## load us map outline (part of ggplot2)
states_map <- map_data("state")
# create map for illustrating percent increase by state (choropleth map - this is figure 2 in the paper)
states_map %>%
# merge in data on percent increase by state
left_join(to_map %>%
# format for merging with map data
mutate(region = str_to_lower(state)) ) %>%
ggplot() +
geom_polygon(aes(x = long, y = lat, fill = percent_increase, group = group), color = "black") +
coord_fixed(1.3) + # define aspect ratio
theme_void() + # removes lat/long axes
scale_fill_distiller(na.value = "white", direction = 1, palette = "Reds", name = "percent increase")
In terms of who migrated out of Puerto Rico after Hurricane Maria, there is an increase in the 15-29 year age range. There were fewer people above the age of 35 migrating out of Puerto Rico in a post-disaster context.
# pr pop vs all other migrant pop, according to fb data
# (non-pr includes mexico)
d_tot <-
fb %>%
mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>%
# change pr expat pop of those age 35-39, male in connecticut to 4300. as given in data: 20. in wave 3 the number is 4300
mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>%
# calculate total pr/non-pr migrant population for each age group, by wave
group_by(pr, age_group, wave, date) %>%
summarise(expat_population = sum(expat_population), .groups='drop') %>% # 7 waves x 9 age group x 2 migrant groups = 126 unique combinations
group_by(wave, pr) %>%
# calculate proportion of age group out of total fb expat (pr, non-pr) population in that wave (separate for pr, non-pr)
mutate(exp_prop = expat_population/sum(expat_population),
var_prop = exp_prop*(1-exp_prop)/expat_population*10) %>% # associated variance
ungroup()
## plot results (figure 3 in the paper, but with a confidence interval)
# line plot
d_tot %>%
group_by(age_group) %>%
summarise(did = ((exp_prop[wave==5&pr==1]-exp_prop[wave==4&pr==1])*100 - (exp_prop[wave==5&pr==0]-exp_prop[wave==4&pr==0])*100),
se_did = sqrt(sqrt(var_prop[wave==5&pr==1]+var_prop[wave==4&pr==1])^2 + sqrt(var_prop[wave==5&pr==0]+var_prop[wave==4&pr==0])^2),
lower = did - (2*se_did), upper = did + (2*se_did) ) %>%
ggplot(aes(age_group, did)) + geom_line() + geom_hline(yintercept = 0) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
xlab("Age group") + ylab("Change (percentage points)") +
theme_bw(base_size = 14)
There were larger numbers of men than women in all nine states: this did not dramatically change after the Hurricane. However, there were some states (i.e. Texas) that suggested an increase of male migrants than other states (i.e. Pennsylvania) that suggested an increase of female migrants.
## change in sex ratio between waves in the fb data, by state (table 3)
fb %>%
mutate(pr = ifelse(origin== "Puerto Rico", 1, 0)) %>%
mutate(expat_population = ifelse(state=="Connecticut"&age_group==35&wave==4&sex==1&pr==1, 4300, expat_population)) %>%
left_join(total_fb_pop_state) %>%
group_by(wave, state, pr, sex) %>%
summarise(expat_population = sum(expat_population, na.rm = T), # total expat_pop by combination
fb = fb[row_number()==1], # keep the first value in the fb column for each wave/sex combination
prop = expat_population/fb, # calculate pr expat pop as proportion of total fb pop
var_prop = prop*(1-prop)/(fb/1000)) %>% # calculate variance for the proportion
group_by(wave, state, pr) %>%
summarise(sr = expat_population[sex==1]/expat_population[sex==2], # calculate sex ratio in fb data
se = (prop[sex==1]/prop[sex==2])^2*(var_prop[sex==1]/(prop[sex==1]*100)^2 + var_prop[sex==2]/(prop[sex==2]*100)^2)) %>% # calculate associated se (see stacks link from sex ratio section)
group_by(state) %>%
# calculate difference in sex ratio between waves 4/5 and between pr, non-pr
mutate(sr_diff = (sr[wave==5&pr==1]-sr[wave==4&pr==1])/sr[wave==4&pr==1] - (sr[wave==5&pr==0]-sr[wave==4&pr==0])/sr[wave==4&pr==0],
# calculate se for difference in sex ratio
se_diff = sqrt(sqrt(se[wave==5&pr==1]^2+se[wave==4&pr==1]^2) + sqrt(se[wave==5&pr==0]^2+se[wave==4&pr==0]^2))) %>%
filter(state %in% c("Florida", "New York", "New Jersey",
"Pennsylvania", "Illinois", "Ohio", "Connecticut",
"Texas", "Massachusetts", "Georgia", "California"), wave ==4, pr==1) %>%
mutate(sex_ratio = round(sr,3), change = round(sr_diff,3),
lower = round(sr_diff - (2*se_diff),3),
upper = round(sr_diff + (2*se_diff),3)) %>%
select(state, sex_ratio, change, lower, upper) %>%
arrange(-change) %>%
kable() %>%
kable_paper()
| state | sex_ratio | change | lower | upper |
|---|---|---|---|---|
| Florida | 1.163 | 0.080 | 0.076 | 0.083 |
| Texas | 1.008 | 0.065 | 0.059 | 0.071 |
| Georgia | 1.140 | 0.018 | 0.006 | 0.031 |
| Connecticut | 1.352 | 0.014 | 0.004 | 0.024 |
| New Jersey | 1.187 | 0.014 | 0.005 | 0.023 |
| California | 0.962 | 0.011 | 0.000 | 0.021 |
| Massachusetts | 1.279 | 0.008 | 0.000 | 0.016 |
| New York | 1.215 | 0.006 | 0.000 | 0.011 |
| Ohio | 1.133 | 0.005 | -0.007 | 0.016 |
| Illinois | 1.127 | -0.002 | -0.013 | 0.008 |
| Pennsylvania | 1.186 | -0.053 | -0.059 | -0.047 |
In estimating return migration, Alexander et al specifically look at the decrease in the Puerto Rican population in the continental US after the hurricane. However, it doesn’t seem like there is enough information to suggest that an inference about where the migrants who originate from Puerto Rico move to after their time in the US is appropriate. While it is likely that it is possible that they return, this migration pattern is inferred.
Using this assumption the results indicate a 1.8 percent of Puerto Ricans that return to Puerto Rico after the hurricane.
Note: Authors recognize that there is a rounding issue after Wave 5, thus keeping only those records with expat_population over 1000)
# 6. Return migration -----------------------------------------------------
# national
ex_props_post_nat <- fb %>%
filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>1000) %>% # rounding issue after wave 5
group_by(wave) %>%
summarise(expat = sum(expat_population)) %>% # calculate number of expats per wave
# merge in total fb population in the us
left_join(total_fb_pop, by="wave") %>%
# calculate proportion of expats in the total fb population
mutate(prop = expat/fb) %>%
left_join(fb %>%
filter(origin=="Puerto Rico", expat_population>1000) %>%
# caclulate fb pr expat population by wave
group_by(wave) %>%
summarise(expat_pr = sum(expat_population)),
by="wave") %>%
mutate(prop_pr = expat_pr / fb, # proportion of pr expats out of total fb population
var_prop = prop*(1-prop)/(fb/1000), # variance of expats out of total fb population
var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>% # variance of pr expats out of total fb population
mutate(prop_diff = (prop - lag(prop))/lag(prop), # calculate difference between waves (all expats, -mx)
se_diff = sqrt(var_prop + lag(var_prop)), # se for the difference between waves
prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr), # calculate difference between waves, pr only
se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)), # se for the difference between waves
diff_in_diff = (prop_pr_diff - prop_diff)*100, # calculate diff-in-diff estimator (in percent)
se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) # se for diff-in-diff estimator (in percent)
res_return <- ex_props_post_nat %>%
# examine difference between waves 5 and 6
filter(wave==6) %>%
select(diff_in_diff, se_diff_diff) %>%
rename(percent_increase = diff_in_diff, se = se_diff_diff) %>%
mutate(ci_lower = percent_increase - (2*se), ci_upper = percent_increase + (2*se))
res_return %>%
kable(digits=2) %>%
kable_paper()
| percent_increase | se | ci_lower | ci_upper |
|---|---|---|---|
| -2.01 | 0.06 | -2.13 | -1.88 |
## use these percentages with acs data to estimate number of return migrants to pr (at national level)
acs_prop_age %>% filter(origin=="Puerto Rico") %>%
ungroup() %>%
summarise(mean = sum(no_mig)*res_return$percent_increase/100,
lower = sum(no_mig)*res_return$ci_lower/100,
upper = sum(no_mig)*res_return$ci_upper/100) %>%
kable(format.args=list(big.mark=','), caption="estimated number of return migrants") %>%
kable_paper()
| mean | lower | upper |
|---|---|---|
| -21,823.71 | -23,166.45 | -20,480.97 |
Making the same assumption as above, Alexander et al also find that there is a decrease in the Puerto Rican population in Florida and Texas (the states with the greatest in-migration during the hurricane). Other states like California and New York, saw an increase of Puerto Rican migrants during this time.
# by state
ex_props_post <- fb %>%
# all other expats, but pr and mx
filter(origin!="Puerto Rico", origin!= "Mexico", expat_population>1000) %>% # rounding issue after wave 5
group_by(state, wave) %>%
summarise(expat = sum(expat_population), .groups='drop') %>% # calculate number of expats in each state per wave
# merge back in total fb population
left_join(total_fb_pop_state) %>%
mutate(prop = expat/fb) %>% # calculate expat proportion out of total fb population
group_by(state) %>%
left_join(fb %>%
filter(origin=="Puerto Rico") %>%
group_by(state, wave) %>%
summarise(expat_pr = sum(expat_population))) %>% # number of pr expats per state
# calculate proportions
mutate(prop_pr = expat_pr / fb, # pr expats as a proportion of total fb population
var_prop = prop*(1-prop)/(fb/1000), # variance for all expats as proportion of total fb population
var_prop_pr = prop_pr*(1-prop_pr)/(fb/1000)) %>% # variance for pr expats as proportion of total fb pop
# calculate diff in diff for each state and wave
arrange(state, wave) %>%
group_by(state) %>%
mutate(prop_diff = (prop - lag(prop))/lag(prop), # difference between waves (all expats -mx, -pr)
se_diff = sqrt(var_prop + lag(var_prop)), # se for difference in "all" expats between waves
prop_pr_diff = (prop_pr - lag(prop_pr))/lag(prop_pr), # difference in pr expats between waves
se_diff_pr = sqrt(var_prop_pr + lag(var_prop_pr)), # se for difference in pr expats between waves
diff_in_diff = (prop_pr_diff - prop_diff)*100, # diff-in-diff estimator (in percent)
se_diff_diff = sqrt(se_diff^2 + se_diff_pr^2)*100) %>% # se for diff-in-diff estimator
ungroup()
# notice that only 33 states
ex_props_post %>%
# keep only state-wave combinations where total pr expats greater than 18k (leaves 14 states [if nc included])
# and diff-in-diff estimators from wave 6 (this is the difference b/w waves 5 and 6)
filter(expat_pr>18000, !is.na(diff_in_diff), wave == 6) %>%
select(state, prop_pr, diff_in_diff, se_diff_diff) %>%
mutate(percent_change = round(diff_in_diff, 1),
percent_lower = percent_change - (2*se_diff_diff),
percent_upper = percent_change + (2*se_diff_diff) ) %>%
select(state, percent_change, percent_lower, percent_upper) %>%
arrange(percent_change) %>%
# merge in acs estimates of pr indivdiuals in the us by state
left_join(acs_prop_age %>%
filter(origin=="Puerto Rico") %>%
group_by(state) %>%
summarise(mig = sum(no_mig))) %>% # calculate number of pr individuals in each state
# calculate estimated number of pr return migrants + associated ci
mutate(estimated_number = round(mig*percent_change/100),
number_lower = round(mig*percent_lower/100),
number_upper = round(mig*percent_upper/100)) %>%
# remove acs pr estimate per state
select(-mig) %>%
# arrange states in order by estimated number of pr return migrants
arrange(estimated_number) %>%
kable(format.args=list(big.mark=','), digits=3,
caption="estimated return migrants to Puerto Rico for select states") %>%
kable_paper()
| state | percent_change | percent_lower | percent_upper | estimated_number | number_lower | number_upper |
|---|---|---|---|---|---|---|
| Florida | -7.1 | -7.766 | -6.434 | -21,508 | -23,526 | -19,490 |
| Massachusetts | -4.5 | -5.524 | -3.476 | -3,991 | -4,899 | -3,083 |
| Connecticut | -3.6 | -5.043 | -2.157 | -2,302 | -3,226 | -1,379 |
| Texas | -3.5 | -3.908 | -3.092 | -1,840 | -2,055 | -1,625 |
| North Carolina | -7.5 | -7.987 | -7.013 | -1,442 | -1,535 | -1,348 |
| Pennsylvania | -1.4 | -2.027 | -0.773 | -1,404 | -2,033 | -776 |
| Ohio | -1.7 | -2.143 | -1.257 | -435 | -548 | -322 |
| New York | 0.4 | -0.241 | 1.041 | 526 | -318 | 1,371 |
| Illinois | 2.8 | 2.150 | 3.450 | 747 | 574 | 920 |
| Georgia | 6.4 | 5.859 | 6.941 | 1,273 | 1,165 | 1,381 |
| New Jersey | 2.3 | 1.154 | 3.446 | 1,810 | 908 | 2,712 |
| California | 8.5 | 7.965 | 9.035 | 2,029 | 1,902 | 2,157 |
| Virginia | 13.6 | 12.880 | 14.320 | 2,995 | 2,836 | 3,153 |
| Wisconsin | 24.5 | 23.930 | 25.070 | 3,049 | 2,978 | 3,120 |